Latent MOE & Repeated MTP support for NemotronH; fix KV cache quant export #830

jenchen13 · 2026-01-29T17:59:17Z

What does this PR do?

Type of change: New feature and bug fix

Overview:

Support Latent MOE and Repeated MTP for NemotronH models

Enable latent MOE modules during megatron import/export
Fix KV cache quantization export: remove old qkv_layer.output_quantizer export & replace with proper k/v_bmm_quantizer logic
Remove KV quantization scale clamping to 1
Improvements to EP amax sync
Support repeated MTP import/export for NemotronH models

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

Release Notes

New Features
- Added support for grouped MLP and self-attention scaling operations in model export workflows
- Enhanced model parallel training capabilities with improved component mapping
- Expanded quantization configuration handling with dynamic module exclusion across distributed ranks
- Improved support for additional transformer engine components
Refactor
- Reorganized internal export and import logic for improved maintainability and specialist model architecture support

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

coderabbitai · 2026-01-29T17:59:41Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

📝 Walkthrough

Walkthrough

This PR extends Megatron Core model export and quantization with support for grouped MLP merging, self-attention scaling, and MTP (model tensor parallel) training. It refactors the importer into modular helper methods, improves KV cache scaling factor handling, adds HuggingFace quantization config support, enhances MOE expert synchronization, and integrates TELinear quantization.

Changes

Cohort / File(s)	Summary
Custom Module Mappings `modelopt/torch/export/plugins/mcore_custom.py`, `modelopt/torch/export/plugins/mcore_nemotron.py`	Introduces two new custom module mapping classes: `GroupedMLPMerging` and `SelfAttentionScaling`. Integrates them into Nemotron export/import mappings alongside latent MoE projections, repeated MTP module entries, and grouped local expert configurations.
Importer Refactoring with MTP Support `modelopt/torch/export/plugins/megatron_importer.py`	Replaces direct `torch.distributed` usage with `dist` alias. Adds `is_mtp` parameters to `_name_remapping` and `_qkv_merging`. Introduces `_grouped_mlp_merging`, `_import_mamba_layer`, and `_import_transformer_layer` helper methods. Expands MTP import logic to handle both repeated and non-repeated MTP layers with appropriate dispatch based on layer type.
KV Cache Scaling Utilities `modelopt/torch/export/quant_utils.py`	Refactors `get_kv_cache_scaling_factor`: renames parameter from `kv_module` to `self_attention_module`, changes return type from list to single `torch.Tensor`, removes FP8-specific warning/clamping logic. Introduces new `_compute_kv_cache_dtype` helper to consolidate dtype detection from num_bits list.
HuggingFace Export and State Dict Handling `modelopt/torch/export/unified_export_megatron.py`	Adds HF quant config generation with `convert_hf_quant_config_format` and dynamic exclude_modules aggregation. Introduces comprehensive `_get_state_dict` family of helpers (`_get_transformer_layer_state_dict`, `_get_mtp_state_dict`, `_get_mamba_layer_state_dict`, etc.). Implements `_self_attention_scaling` for KV cache scaling factor population. Changes `moe_router_dtype` parameter type from `torch.dtype` to `str`. Enhances save/export flow with per-prefix quantized state extraction.
MOE Expert Synchronization `modelopt/torch/quantization/model_calib.py`	Adds post-stats-collection synchronization pass calling `sync_moe_local_experts_amax()` on modules that expose it. Reorganizes step numbering in data-parallel amax synchronization path. Removes previous in-loop MOE amax sync from TP synchronization block.
TELinear Quantization Support `modelopt/torch/quantization/plugins/megatron.py`	Adds `TELinear` import from transformer_engine. Introduces new `_QuantTEMCoreLinear` quantization class registered via `QuantModuleRegistry` for `TELinear` mapping. Removes synchronization barrier in `sync_moe_local_experts_amax`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main changes: latent MOE support, repeated MTP support for NemotronH, and KV cache quantization export fixes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch jennifchen/nmh_latent_moe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

modelopt/torch/export/quant_utils.py (1)
387-422: KV cache dtype now treats all FP4 as affine.
Line 396 computes is_affine, but _compute_kv_cache_dtype ignores it and defaults to True, so (2, 1) always maps to KV_CACHE_NVFP4_AFFINE even when bias is missing. This changes export semantics.
🛠️ Suggested fix
-    return _compute_kv_cache_dtype(num_bits_list)
+    return _compute_kv_cache_dtype(num_bits_list, is_affine)

-def _compute_kv_cache_dtype(num_bits_list: list[int]) -> str | None:
+def _compute_kv_cache_dtype(num_bits_list: list[int], is_affine: bool) -> str | None:
     """Returns the kv_cache dtype.

     If num_bits of output_quantizer is (4, 3) then returns FP8; if it is 8, returns int8,
     otherwise returns None.

     Args:
-        modules: The module or list of modules to inspect.
+        num_bits_list: List of num_bits from KV quantizers.
+        is_affine: Whether KV quantizers have bias/affine params.

     Returns:
         The kv_cache dtype.
     """
-    is_affine = True
-
     if (4, 3) in num_bits_list:
         return KV_CACHE_FP8
modelopt/torch/export/unified_export_megatron.py (1)
745-803: exclude_modules never populates due to qformat is None check.
get_quantization_format returns QUANTIZATION_NONE, not None, so Line 763 never triggers and the exclude list stays empty.
🛠️ Suggested fix
-        if qformat is None and "norm" not in prefix:  # Add exclude layers for hf_quant_config
+        if qformat == QUANTIZATION_NONE and "norm" not in prefix:  # Add exclude layers
             self.exclude_modules.append(prefix)

🤖 Fix all issues with AI agents

In `@modelopt/torch/export/plugins/mcore_nemotron.py`:
- Line 103: The inline comment "# TODO ADD MTP export" is stale because the
export mapping in this module now contains MTP entries; remove the TODO or
replace it with a precise note about any remaining work (e.g., which specific
MTP formats or tests are still missing) so readers aren’t misled—update the
comment near the export mapping and any related MTP export code references to
reflect current status.

In `@modelopt/torch/export/plugins/megatron_importer.py`:
- Around line 264-290: The grouped MLP loader (_grouped_mlp_merging) currently
skips any per-expert quantization scales (TODO weight_scale), so when importing
quantized grouped MLPs you must detect and copy the scale tensors into the
module state_dict before calling module.load_state_dict. Modify
_grouped_mlp_merging to (1) inspect the existing state_dict keys for the
quantizer scale key pattern used by TEGroupedMLP (e.g.,
"weight_quantizer._scale" or similar) and (2) for each expert_id call
self._get_safetensor with the matching safetensor key for that expert's scale
(e.g., use the same prefix.format(expert_id) + the scale suffix used in the
checkpoint) and assign that tensor into state_dict under the exact key name
expected by the module (mirroring how you already set
state_dict[f"weight{expert_id}"] = tensor for weights); finally preserve
dtype/shape checks and then call module.load_state_dict(state_dict) as before.
- Around line 693-774: The non-repeated MTP branch never increments or
initializes layer_id so every mtp in the for mtp in model.mtp.layers loop reuses
the same prefix; fix by initializing layer_id (e.g., layer_id = 0) before the
loop and incrementing layer_id at the end of each iteration of the for mtp in
model.mtp.layers loop so subsequent calls to self.rules["mtp.*"](...) use unique
layer indices.

In `@modelopt/torch/export/quant_utils.py`:
- Around line 348-367: The function get_kv_cache_scaling_factor currently
returns a list of two scalars but is annotated as torch.Tensor; update its type
hint to reflect the actual return type (for example List[Optional[torch.Tensor]]
or Tuple[Optional[torch.Tensor], Optional[torch.Tensor]]) and add necessary
typing imports (List, Optional, Tuple) if not present, and adjust the docstring
return description accordingly; keep the implementation that calls
get_scaling_factor on attributes k_bmm_quantizer and v_bmm_quantizer unchanged.

In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 1248-1255: The type annotation for moe_router_dtype is
inconsistent: export_mcore_gpt_to_hf declares it as str | None but
GPTModelExporter.__init__ and import_mcore_gpt_from_hf use torch.dtype | None;
change GPTModelExporter.__init__ signature to use str | None to match
export_mcore_gpt_to_hf and the implementation that expects string values like
"fp32"/"fp64", update import_mcore_gpt_from_hf to also use str | None for
moe_router_dtype, and add a brief docstring entry for moe_router_dtype in
export_mcore_gpt_to_hf describing allowed string values and meaning.

🧹 Nitpick comments (3)

modelopt/torch/quantization/plugins/megatron.py (1)
49-58: Make TELinear import optional to handle older Megatron-Core versions gracefully.

TELinear is available in Megatron-Core v0.9.0+ (Oct 2024), but the current import structure causes the entire TE try-except block to fail if any module is missing, disabling all TE quantization support. While TELinear is now stable, making it optional with a nested try-except is defensive programming that preserves TE support for environments using older Megatron-Core versions. Similarly, guard the _QuantTEMCoreLinear registration with an additional check for TELinear availability.
Suggested change
 try:
     from megatron.core.extensions.transformer_engine import (
         TEColumnParallelGroupedLinear,
         TEColumnParallelLinear,
         TEDotProductAttention,
         TELayerNormColumnParallelLinear,
-        TELinear,
         TERowParallelGroupedLinear,
         TERowParallelLinear,
     )
+    try:
+        from megatron.core.extensions.transformer_engine import TELinear
+    except ImportError:
+        TELinear = None
 
     from .transformer_engine import _QuantTEGroupedLinear, _QuantTELayerNormLinear, _QuantTELinear
 
     HAS_TE = True
 except ImportError:
     HAS_TE = False
+    TELinear = None
-if HAS_TE:
+if HAS_TE and TELinear is not None:
 
     `@QuantModuleRegistry.register`({TELinear: "te_mcore_Linear"})
     class _QuantTEMCoreLinear(_QuantTELinear):
         pass
modelopt/torch/export/unified_export_megatron.py (2)
113-158: Handle moe_router_dtype inputs consistently and avoid unconditional prints.
Line 113 still types torch.dtype, but the implementation only handles string inputs; callers passing torch.float32 now silently get None. Also, Line 129 prints unconditionally. Consider accepting both types and gating logs.
♻️ Suggested refactor
-        moe_router_dtype: torch.dtype | None = None,
+        moe_router_dtype: str | torch.dtype | None = None,
@@
-        self.moe_router_dtype = None
-        if moe_router_dtype == "fp32":
+        self.moe_router_dtype = None
+        if isinstance(moe_router_dtype, torch.dtype):
+            self.moe_router_dtype = moe_router_dtype
+        elif moe_router_dtype == "fp32":
             self.moe_router_dtype = torch.float32
         elif moe_router_dtype == "fp64":
             self.moe_router_dtype = torch.float64
-        print(f"Exporting model with moe_router_dtype: {self.moe_router_dtype}")
+        if torch.distributed.is_initialized():
+            if torch.distributed.get_rank() == 0:
+                print(f"Exporting model with moe_router_dtype: {self.moe_router_dtype}")
+        else:
+            print(f"Exporting model with moe_router_dtype: {self.moe_router_dtype}")
835-870: Gate debug prints behind a logger/flag.
Lines 838–870 print per‑tensor dtype for every module; this is noisy and can slow large exports.

modelopt/torch/export/plugins/mcore_nemotron.py

coderabbitai · 2026-01-29T18:15:49Z

modelopt/torch/export/plugins/megatron_importer.py

+    def _grouped_mlp_merging(
+        self,
+        module,
+        prefix,
+        parallel_config: ParallelConfig | None = None,
+        is_mtp: bool = False,
+        init_expert_id: int = 0,
+        num_local_experts: int = 1,
+    ):
+        if is_mtp:
+            if "backbone" in prefix:
+                prefix = prefix.replace("backbone", "mtp")
+            else:
+                prefix = prefix.replace("model", "mtp")
+
+        state_dict = module.state_dict()
+        # TODO handle weight_scale
+        # weight_scale = state_dict.get("weight_quantizer._scale", None)
+
+        assert module.num_gemms == num_local_experts, (
+            "num_gemms must be equal to num_local_experts in TEGroupedMLP"
+        )
+        for expert_id in range(init_expert_id, init_expert_id + num_local_experts):
+            tensor = self._get_safetensor(prefix.format(expert_id) + ".weight")
+            state_dict[f"weight{expert_id}"] = tensor
+
+        module.load_state_dict(state_dict)


⚠️ Potential issue | 🟡 Minor

Grouped MLP import skips weight scales.
Line 280 notes a TODO for weight_scale handling; if grouped MLPs can be quantized, this will leave scales stale.

🤖 Prompt for AI Agents

In `@modelopt/torch/export/plugins/megatron_importer.py` around lines 264 - 290, The grouped MLP loader (_grouped_mlp_merging) currently skips any per-expert quantization scales (TODO weight_scale), so when importing quantized grouped MLPs you must detect and copy the scale tensors into the module state_dict before calling module.load_state_dict. Modify _grouped_mlp_merging to (1) inspect the existing state_dict keys for the quantizer scale key pattern used by TEGroupedMLP (e.g., "weight_quantizer._scale" or similar) and (2) for each expert_id call self._get_safetensor with the matching safetensor key for that expert's scale (e.g., use the same prefix.format(expert_id) + the scale suffix used in the checkpoint) and assign that tensor into state_dict under the exact key name expected by the module (mirroring how you already set state_dict[f"weight{expert_id}"] = tensor for weights); finally preserve dtype/shape checks and then call module.load_state_dict(state_dict) as before.

coderabbitai · 2026-01-29T18:15:49Z

modelopt/torch/export/plugins/megatron_importer.py

        # MTP
        if hasattr(model, "mtp"):
+            print("Importing MTP", flush=True)
            # MTP is the last layer in DeepSeek V3/R1
-            layer_id += 1
-            for mtp in model.mtp:
-                self.rules["mtp.fc"](mtp.fc, layer_id)
+            if len(model.mtp.layers) == 1:  # Repeated MTP
+                layer_id = 0  # reset layer_id for repeated MTP
+                mtp = model.mtp.layers[0]
+
+                self.rules["mtp.eh_proj"](mtp.eh_proj, layer_id)
                self.rules["mtp.enorm"](mtp.enorm, layer_id)
                self.rules["mtp.hnorm"](mtp.hnorm, layer_id)
-                self.rules["mtp.input_layernorm"](mtp.decoder.layers[0].input_layernorm, layer_id)
-                if hasattr(mtp.decoder.layers[0].self_attention, "linear_q_proj"):
-                    self.rules["mtp.linear_q_proj"](
-                        mtp.decoder.layers[0].self_attention.linear_q_proj, layer_id
+
+                mtp_model_layers = mtp.mtp_model_layer.layers
+                for mtp_model_layer in mtp_model_layers:
+                    if isinstance(mtp_model_layer, MambaLayer):
+                        self._import_mamba_layer(mtp_model_layer, layer_id, layer_pbar)
+                    elif isinstance(mtp_model_layer, TransformerLayer):
+                        self._import_transformer_layer(
+                            mtp_model_layer, layer_id, layer_pbar, is_mtp=True
+                        )
+                    else:
+                        raise ValueError(
+                            f"Unsupported layer type during MTP import: {type(mtp_model_layer)}"
+                        )
+
+                    layer_id += 1
+            else:  # non-repeated MTP
+                for mtp in model.mtp.layers:
+                    self.rules["mtp.eh_proj"](mtp.eh_proj, layer_id)
+                    self.rules["mtp.enorm"](mtp.enorm, layer_id)
+                    self.rules["mtp.hnorm"](mtp.hnorm, layer_id)
+                    self.rules["mtp.input_layernorm"](
+                        mtp.decoder.layers[0].input_layernorm, layer_id
                    )
-                else:
-                    self.rules["mtp.linear_q_down_proj"](
-                        mtp.decoder.layers[0].self_attention.linear_q_down_proj, layer_id
+                    if hasattr(mtp.decoder.layers[0].self_attention, "linear_q_proj"):
+                        self.rules["mtp.linear_q_proj"](
+                            mtp.decoder.layers[0].self_attention.linear_q_proj, layer_id
+                        )
+                    else:
+                        self.rules["mtp.linear_q_down_proj"](
+                            mtp.decoder.layers[0].self_attention.linear_q_down_proj, layer_id
+                        )
+                        self.rules["mtp.linear_q_layernorm"](
+                            mtp.decoder.layers[0].self_attention.q_layernorm, layer_id
+                        )
+                        self.rules["mtp.linear_q_up_proj"](
+                            mtp.decoder.layers[0].self_attention.linear_q_up_proj, layer_id
+                        )
+                    self.rules["mtp.linear_kv_down_proj"](
+                        mtp.decoder.layers[0].self_attention.linear_kv_down_proj, layer_id
                    )
-                    self.rules["mtp.linear_q_layernorm"](
-                        mtp.decoder.layers[0].self_attention.q_layernorm, layer_id
+                    self.rules["mtp.linear_kv_layernorm"](
+                        mtp.decoder.layers[0].self_attention.kv_layernorm, layer_id
                    )
-                    self.rules["mtp.linear_q_up_proj"](
-                        mtp.decoder.layers[0].self_attention.linear_q_up_proj, layer_id
+                    self.rules["mtp.linear_kv_up_proj"](
+                        mtp.decoder.layers[0].self_attention.linear_kv_up_proj, layer_id
                    )
-                self.rules["mtp.linear_kv_down_proj"](
-                    mtp.decoder.layers[0].self_attention.linear_kv_down_proj, layer_id
-                )
-                self.rules["mtp.linear_kv_layernorm"](
-                    mtp.decoder.layers[0].self_attention.kv_layernorm, layer_id
-                )
-                self.rules["mtp.linear_kv_up_proj"](
-                    mtp.decoder.layers[0].self_attention.linear_kv_up_proj, layer_id
-                )
-                self.rules["mtp.linear_proj"](
-                    mtp.decoder.layers[0].self_attention.linear_proj, layer_id
-                )
-                self.rules["mtp.pre_mlp_layernorm"](
-                    mtp.decoder.layers[0].pre_mlp_layernorm, layer_id
-                )
-                self.rules["mtp.router"](mtp.decoder.layers[0].mlp.router, layer_id)
-                self.rules["mtp.shared_experts.linear_fc1"](
-                    mtp.decoder.layers[0].mlp.shared_experts.linear_fc1, layer_id
-                )
-                self.rules["mtp.shared_experts.linear_fc2"](
-                    mtp.decoder.layers[0].mlp.shared_experts.linear_fc2, layer_id
-                )
-                for expert_id, expert in tqdm(
-                    enumerate(mtp.decoder.layers[0].mlp.experts.local_experts),
-                    desc="Importing MoE local experts",
-                    leave=False,
-                    disable=self.disable_tqdm,
-                ):
-                    self.rules["mtp.local_experts.linear_fc1"](
-                        expert.linear_fc1, layer_id, expert_id
+                    self.rules["mtp.linear_proj"](
+                        mtp.decoder.layers[0].self_attention.linear_proj, layer_id
                    )
-                    self.rules["mtp.local_experts.linear_fc2"](
-                        expert.linear_fc2, layer_id, expert_id
+                    self.rules["mtp.pre_mlp_layernorm"](
+                        mtp.decoder.layers[0].pre_mlp_layernorm, layer_id
                    )
+                    self.rules["mtp.router"](mtp.decoder.layers[0].mlp.router, layer_id)
+                    self.rules["mtp.shared_experts.linear_fc1"](
+                        mtp.decoder.layers[0].mlp.shared_experts.linear_fc1, layer_id
+                    )
+                    self.rules["mtp.shared_experts.linear_fc2"](
+                        mtp.decoder.layers[0].mlp.shared_experts.linear_fc2, layer_id
+                    )
+                    for expert_id, expert in tqdm(
+                        enumerate(mtp.decoder.layers[0].mlp.experts.local_experts),
+                        desc="Importing MoE local experts",
+                        leave=False,
+                        disable=self.disable_tqdm,
+                    ):
+                        self.rules["mtp.local_experts.linear_fc1"](
+                            expert.linear_fc1, layer_id, expert_id
+                        )
+                        self.rules["mtp.local_experts.linear_fc2"](
+                            expert.linear_fc2, layer_id, expert_id
+                        )


⚠️ Potential issue | 🟠 Major

Non‑repeated MTP import reuses the same layer_id for every layer.
layer_id is never updated in the non‑repeated branch, so each iteration overwrites the same prefix.

🛠️ Suggested fix

- else: # non-repeated MTP - for mtp in model.mtp.layers: + else: # non-repeated MTP + for mtp_id, mtp in enumerate(model.mtp.layers): + layer_id = mtp_id self.rules["mtp.eh_proj"](mtp.eh_proj, layer_id) self.rules["mtp.enorm"](mtp.enorm, layer_id) self.rules["mtp.hnorm"](mtp.hnorm, layer_id)

🤖 Prompt for AI Agents

In `@modelopt/torch/export/plugins/megatron_importer.py` around lines 693 - 774, The non-repeated MTP branch never increments or initializes layer_id so every mtp in the for mtp in model.mtp.layers loop reuses the same prefix; fix by initializing layer_id (e.g., layer_id = 0) before the loop and incrementing layer_id at the end of each iteration of the for mtp in model.mtp.layers loop so subsequent calls to self.rules["mtp.*"](...) use unique layer indices.

modelopt/torch/export/quant_utils.py

coderabbitai · 2026-01-29T18:15:49Z

modelopt/torch/export/unified_export_megatron.py

 def export_mcore_gpt_to_hf(
    model: torch.nn.Module,
    pretrained_model_name_or_path: str | os.PathLike | None = None,
    export_extra_modules: bool = False,
    dtype: torch.dtype = torch.bfloat16,
    export_dir: Path | str = tempfile.gettempdir(),
-    moe_router_dtype: torch.dtype | None = None,
+    moe_router_dtype: str | None = None,
 ):


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify call sites and parameter usage for the signature change. rg -n --type=py 'export_mcore_gpt_to_hf\(' rg -n --type=py 'GPTModelExporter\('

Repository: NVIDIA/Model-Optimizer

Length of output: 528

🏁 Script executed:

#!/bin/bash # Examine the test call site echo "=== Test call site (line 94) ===" sed -n '85,105p' tests/gpu/torch/export/test_unified_export_megatron.py # Examine the function and GPTModelExporter instantiation echo "=== Function definition and GPTModelExporter instantiation ===" sed -n '1248,1280p' modelopt/torch/export/unified_export_megatron.py # Check if there's a docstring with parameter documentation echo "=== Full function with docstring ===" sed -n '1248,1300p' modelopt/torch/export/unified_export_megatron.py

Repository: NVIDIA/Model-Optimizer

Length of output: 4205

🏁 Script executed:

#!/bin/bash # Find GPTModelExporter class definition echo "=== GPTModelExporter class definition ===" rg -n 'class GPTModelExporter' --type=py # Get the class definition and __init__ method echo "=== GPTModelExporter __init__ ===" rg -A 20 'class GPTModelExporter' --type=py

Repository: NVIDIA/Model-Optimizer

Length of output: 2142

🏁 Script executed:

#!/bin/bash # Get the full __init__ method of GPTModelExporter sed -n '100,150p' modelopt/torch/export/unified_export_megatron.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2574

Fix type annotation inconsistency for moe_router_dtype parameter.

The export_mcore_gpt_to_hf function signature correctly uses str | None, but GPTModelExporter.__init__ incorrectly declares it as torch.dtype | None. The implementation in GPTModelExporter (lines 127-130) expects string values like "fp32" and "fp64", not torch.dtype objects. Update GPTModelExporter.__init__ to use str | None to match the actual implementation and the public API.

Additionally, the moe_router_dtype parameter is missing from the docstring of export_mcore_gpt_to_hf, and the type signature for moe_router_dtype in the related import_mcore_gpt_from_hf function uses torch.dtype | None, creating inconsistency across the export/import interface.

🤖 Prompt for AI Agents

In `@modelopt/torch/export/unified_export_megatron.py` around lines 1248 - 1255, The type annotation for moe_router_dtype is inconsistent: export_mcore_gpt_to_hf declares it as str | None but GPTModelExporter.__init__ and import_mcore_gpt_from_hf use torch.dtype | None; change GPTModelExporter.__init__ signature to use str | None to match export_mcore_gpt_to_hf and the implementation that expects string values like "fp32"/"fp64", update import_mcore_gpt_from_hf to also use str | None for moe_router_dtype, and add a brief docstring entry for moe_router_dtype in export_mcore_gpt_to_hf describing allowed string values and meaning.

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

codecov · 2026-01-29T18:38:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.38%. Comparing base (770962b) to head (9ad2d9f).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #830      +/-   ##
==========================================
- Coverage   73.82%   73.38%   -0.45%     
==========================================
  Files         193      193              
  Lines       19745    19894     +149     
==========================================
+ Hits        14577    14599      +22     
- Misses       5168     5295     +127

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ChenhanYu · 2026-01-29T18:50:14Z

modelopt/torch/quantization/model_calib.py

+    # Step 1: Sync amax across local experts in a SequentialMLP
+    for name, module in model.named_modules():
+        if hasattr(module, "sync_moe_local_experts_amax"):
+            module.sync_moe_local_experts_amax()


This function says to sync across EP and ETP but I didn't see any collective communication happen in the function. The description needs to be changed I think.

I can change the name of the method. What it's actually doing is just looping through the local experts and assigning the same amax

it is syncing local experts for sequential moe.

the name does not say or indicate it is collective comm.

cjluo-nv · 2026-01-29T19:42:52Z

I have concerns on "Remove KV quantization scale clamping to 1" without accuracy data support. This also deviates with the HF ptq code path. I think we need to have a discussion about the impact.

realAsma

LGTM! I did not look at export code.

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 added 13 commits January 29, 2026 09:53

support latent moe import and fix local experts sync

2b9aeea

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

patch TransformerLayer forward

989eaab

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

fix kv bmm export

2c6dc71

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

small fixes

aa752fc

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

mtp import fixes

256578a

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

enable TELinear quant

75dd530

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

import grouped mlp in mtp

d99b6a5

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

fix grouped mlp import

d6e62e8

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

fix config.json

e6e10da

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

fix import router dtype bug

7d49569

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

make export exclude modules dynamic

53f09b8

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

dynamically export exclude modules

31fcdc4

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

linting

53103e9

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 requested review from a team as code owners January 29, 2026 17:59

jenchen13 requested review from realAsma and sugunav14 January 29, 2026 17:59

jenchen13 requested review from ChenhanYu and cjluo-nv January 29, 2026 17:59

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

remove comments

e41783a

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 force-pushed the jennifchen/nmh_latent_moe branch from a87e6a9 to e41783a Compare January 29, 2026 18:23

ChenhanYu reviewed Jan 29, 2026

View reviewed changes

ChenhanYu approved these changes Jan 29, 2026

View reviewed changes

jenchen13 changed the title ~~Latent MOE & Repeated MTP support; fix KV cache quant export~~ Latent MOE & Repeated MTP support for NemotronH; fix KV cache quant export Jan 29, 2026

realAsma approved these changes Jan 29, 2026

View reviewed changes

fix bugs

45de68f

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 added 5 commits January 30, 2026 10:44

restore kv cache clamp

14c3d84

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

fix bug

5d1c704

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

fxi bug

e7c32a9

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

fix bug

d548462

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

fix tests

9ad2d9f

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

Latent MOE & Repeated MTP support for NemotronH; fix KV cache quant export #830

Are you sure you want to change the base?

Latent MOE & Repeated MTP support for NemotronH; fix KV cache quant export #830

Uh oh!

Conversation

jenchen13 commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ChenhanYu Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

jenchen13 Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

realAsma Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

realAsma Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv commented Jan 29, 2026

Uh oh!

realAsma left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jenchen13 commented Jan 29, 2026 •

edited

Loading

coderabbitai bot commented Jan 29, 2026 •

edited

Loading

codecov bot commented Jan 29, 2026 •

edited

Loading